36 research outputs found
Iterative Compilation and Performance Prediction for Numerical Applications
Institute for Computing Systems ArchitectureAs the current rate of improvement in processor performance far exceeds the rate
of memory performance, memory latency is the dominant overhead in many
performance critical applications. In many cases, automatic compiler-based
approaches to improving memory performance are limited and programmers
frequently resort to manual optimisation techniques. However, this process is tedious
and time-consuming. Furthermore, a diverse range of a rapidly evolving hardware
makes the optimisation process even more complex. It is often hard to predict the
potential benefits from different optimisations and there are no simple criteria to stop
optimisations i.e. when optimal memory performance has been achieved or
sufficiently approached.
This thesis presents a platform independent optimisation approach for numerical
applications based on iterative feedback-directed program restructuring using a new
reasonably fast and accurate performance prediction technique for guiding
optimisations. New strategies for searching the optimisation space, by means of
profiling to find the best possible program variant, have been developed. These
strategies have been evaluated using a range of kernels and programs on different
platforms and operating systems. A significant performance improvement has been
achieved using new approaches when compared to the state-of-the-art native static
and platform-specific feedback directed compilers
Autotuning Algorithmic Choice for Input Sensitivity
Empirical autotuning is increasingly being used in many domains to achieve optimized performance in a variety of different execution environments. A daunting challenge faced by such autotuners is input sensitivity, where the best autotuned configuration may vary with different input sets. In this paper, we propose a two level solution that: first, clusters to find input sets that are similar in input feature space; then, uses an evolutionary autotuner to build an optimized program for each of these clusters; and, finally, builds an adaptive overhead aware classifier which assigns each input to a specific input optimized program. Our approach addresses the complex trade-off between using expensive features, to accurately characterize an input, and cheaper features, which can be computed with less overhead. Experimental results show that by adapting to different inputs one can obtain up to a 3x speedup over using a single configuration for all inputs
Towards an Achievable Performance for the Loop Nests
Numerous code optimization techniques, including loop nest optimizations,
have been developed over the last four decades. Loop optimization techniques
transform loop nests to improve the performance of the code on a target
architecture, including exposing parallelism. Finding and evaluating an
optimal, semantic-preserving sequence of transformations is a complex problem.
The sequence is guided using heuristics and/or analytical models and there is
no way of knowing how close it gets to optimal performance or if there is any
headroom for improvement. This paper makes two contributions. First, it uses a
comparative analysis of loop optimizations/transformations across multiple
compilers to determine how much headroom may exist for each compiler. And
second, it presents an approach to characterize the loop nests based on their
hardware performance counter values and a Machine Learning approach that
predicts which compiler will generate the fastest code for a loop nest. The
prediction is made for both auto-vectorized, serial compilation and for
auto-parallelization. The results show that the headroom for state-of-the-art
compilers ranges from 1.10x to 1.42x for the serial code and from 1.30x to
1.71x for the auto-parallelized code. These results are based on the Machine
Learning predictions.Comment: Accepted at the 31st International Workshop on Languages and
Compilers for Parallel Computing (LCPC 2018
Predictive runtime code scheduling for heterogeneous architectures
Heterogeneous architectures are currently widespread. With
the advent of easy-to-program general purpose GPUs, virtually every re-
cent desktop computer is a heterogeneous system. Combining the CPU
and the GPU brings great amounts of processing power. However, such
architectures are often used in a restricted way for domain-speci c appli-
cations like scienti c applications and games, and they tend to be used
by a single application at a time. We envision future heterogeneous com-
puting systems where all their heterogeneous resources are continuously
utilized by di erent applications with versioned critical parts to be able
to better adapt their behavior and improve execution time, power con-
sumption, response time and other constraints at runtime. Under such a
model, adaptive scheduling becomes a critical component.
In this paper, we propose a novel predictive user-level scheduler based on
past performance history for heterogeneous systems. We developed sev-
eral scheduling policies and present the study of their impact on system
performance. We demonstrate that such scheduler allows multiple appli-
cations to fully utilize all available processing resources in CPU/GPU-
like systems and consistently achieve speedups ranging from 30% to 40%
compared to just using the GPU in a single application mode.Postprint (published version
Milepost GCC: Machine Learning Enabled Self-tuning Compiler
International audienceTuning compiler optimizations for rapidly evolving hardwaremakes porting and extending an optimizing compiler for each new platform extremely challenging. Iterative optimization is a popular approach to adapting programs to a new architecture automatically using feedback-directed compilation. However, the large number of evaluations required for each program has prevented iterative compilation from widespread take-up in production compilers. Machine learning has been proposed to tune optimizations across programs systematically but is currently limited to a few transformations, long training phases and critically lacks publicly released, stable tools. Our approach is to develop a modular, extensible, self-tuning optimization infrastructure to automatically learn the best optimizations across multiple programs and architectures based on the correlation between program features, run-time behavior and optimizations. In this paper we describeMilepostGCC, the first publicly-available open-source machine learning-based compiler. It consists of an Interactive Compilation Interface (ICI) and plugins to extract program features and exchange optimization data with the cTuning.org open public repository. It automatically adapts the internal optimization heuristic at function-level granularity to improve execution time, code size and compilation time of a new program on a given architecture. Part of the MILEPOST technology together with low-level ICI-inspired plugin framework is now included in the mainline GCC.We developed machine learning plugins based on probabilistic and transductive approaches to predict good combinations of optimizations. Our preliminary experimental results show that it is possible to automatically reduce the execution time of individual MiBench programs, some by more than a factor of 2, while also improving compilation time and code size. On average we are able to reduce the execution time of the MiBench benchmark suite by 11% for the ARC reconfigurable processor.We also present a realistic multi-objective optimization scenario for Berkeley DB library using Milepost GCC and improve execution time by approximately 17%, while reducing compilatio
MLSys: The New Frontier of Machine Learning Systems
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two
MedPerf : Open Benchmarking Platform for Medical Artificial Intelligence using Federated Evaluation
Medical AI has tremendous potential to advance healthcare by supporting the evidence-based practice of medicine, personalizing patient treatment, reducing costs, and improving provider and patient experience. We argue that unlocking this potential requires a systematic way to measure the performance of medical AI models on large-scale heterogeneous data. To meet this need, we are building MedPerf, an open framework for benchmarking machine learning in the medical domain. MedPerf will enable federated evaluation in which models are securely distributed to different facilities for evaluation, thereby empowering healthcare organizations to assess and verify the performance of AI models in an efficient and human-supervised process, while prioritizing privacy. We describe the current challenges healthcare and AI communities face, the need for an open platform, the design philosophy of MedPerf, its current implementation status, and our roadmap. We call for researchers and organizations to join us in creating the MedPerf open benchmarking platform
Collective Optimization
International audienceIterative compilation is an efficient approach to optimize programs on rapidly evolving hardware, but it is still only scarcely used in practice due to a necessity to gather a large number of runs often with the same data set and on the same environment in order to test many different optimizations and to select the most appropriate ones. Naturally, in many cases, users cannot afford a training phase, will run each data set once, develop new programs which are not yet known, and may regularly change the environment the programs are run on. In this article, we propose to overcome that practical obstacle using Collective Optimization, where the task of optimizing a program leverages the experience of many other users, rather than being performed in isolation, and often redundantly, by each user. Collective optimization is an unobtrusive approach, where performance information obtained after each run is sent back to a central database, which is then queried for optimizations suggestions, and the program is then recompiled accordingly. We show that it is possible to learn across data sets, programs and architectures in non-dynamic environments using static function cloning and run-time adaptation without even a reference run to compute speedups over the baseline optimization. We also show that it is possible to simultaneously learn and improve performance, since there are no longer two separate training and test phases, as in most studies.We demonstrate that extensively relying on competition among pairs of optimizations (program reaction to optimizations) provides a robust and efficient method for capturing the impact of optimizations, and for reusing this knowledge across data sets, programs and environments. We implemented our approach in GCC and will publicly disseminate it in the near future